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Abstract 

Bayesian optimization has shown to be a fundamental global optimization algo¬ 
rithm in many applications: ranging from automatic machine learning, robotics, 
reinforcement learning, experimental design, simulations, etc. The most popular 
and effective Bayesian optimization relies on a surrogate model in the form of a 
Gaussian process due to its flexibility to represent a prior over function. However, 
many algorithms and setups relies on the stationarity assumption of the Gaussian 
process. In this paper, we present a novel nonstationary strategy for Bayesian op¬ 
timization that is able to outperform the state of the art in Bayesian optimization 
both in stationary and nonstationary problems. 


1 Introduction 


Many problems in engineering, computer science, economics, etc., require to find the extremum of 
a real valued function. These functions have typically nice properties from a numerical point of 
view, like being continuous and sometimes smooth (e.g.: Lipschitz continuous). However, many 
of those functions represent costly processes, expensive trials or several time consuming compu¬ 
tations. Furthermore, those functions might not have a closed-form expression or might be highly 
multimodal. 

Bayesian optimization, although being a classic method lfl3ll . has become quite popular recently for 
being a sample efficient method of global optimization 0. Recent works have found connections 
with Bayesian optimization and the way biological systems adapt and search, such as human active 
search m or animal adaptation to injuries 0. In machine learning, it has been applied for automatic 
algorithm tuning ED and reinforcement learning Il2l . It is specially suitable for these kind of 
expensive black-box functions and trial-and-error methodologies for using a Bayesian surrogate 
model, that is, a distribution over target functions P(f). This surrogate model has a twofold purpose: 


• It can be updated recursively as outcomes are available from the evaluated trials y, = /(x,) 


^(/|Xl:i,yi:i) = 


V i = 2... n 


( 1 ) 


P( x i,yi) 

• It allows a best response analysis for the decision/action a of selecting the next trial x ra+1 : 


a BO = argmin / 5 n {f, a) dP(f |xi :n ,yi :n ) 
a Jf 


( 2 ) 


where S n (f, a) is the optimality criteria of regret function that drives the optimization towards the 
optimum x*, such as the optimality gap S n (f, a) = / (x„) — /(x*), the Euclidean distance error 
8 n (f, a) = ||x„ — x *||2 or the relative entropy S n (f, a) = JT(x*|xi :n _i) — iT(x*|xi :n ). The first 
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condition allows to make the decisions of the second condition with all the information available at 
that time, thus improving the search of the optimum. 

Without loss of generality, consider the problem of finding the minimum of an unknown real valued 
function / : X — >• R, where X is a compact space, X C d > 1. For the remainder of the paper, 
we are going to assume that the surrogate model P(f) is a Gaussian process £(x)with inputs x € X 
and an associate kernel or covariance function kf, •). One advantage of using Gaussian processes 
(GPs) as a prior distributions over functions is that new observations of the target function (x ,, yf) 
can be easily used to update the distribution over functions. Furthermore, the posterior distribution 
is also a GP: £; = [£(-)|ari ; j, yi;*]. Therefore, the posterior can be used as an informative prior for 
the next iteration in a recursive algorithm. 

The following algorithm summarizes the steps in Bayesian optimization. Note that, without loss of 
generality, for the remainder of the paper we are going to assume a Gaussian process with MCMC on 
the hyperparameters as surrogate model and the expected improvement as the acquisition function. 
However, the proposed algorithms also work with other popular models such as Student-t processes 
mm, a discrete representation of the hyperparameters flOl or different acquisition functions such 
as upper confidence bound ED or relative entropy 00, among others. 


Algorithm 1 Bayesian optimization (BO) with MCMC 


1 

for n = 1... N do 


2 

X 4— xi :n _i y 4— yi- n -i 


3 

for * = 1... to do 

> We have m MCMC samples 

4 

Pi 4- k(x,X|6» i )K(X,X|0 i )y 

> Predicted mean 

5 

o-i <- fc(x,x|0O -k(x,X|0 i )K(X,X|0 i )k(X,x|0 i ) 

> Predicted variance 

6 

end for 


7 

0 4— {9f} r f =1 > Update the hyperparameters using slice sampling 

8 

x„ = arg max Xn £A=i [(y best - Pi) $(2») + a iH z i)\ 

> Expected improvement 

9 

y n /(x„) 


10 

end for 



The main contribution of the paper is an algorithm for improved Bayesian optimization using a com¬ 
bination of local and global kernels to achieve a nonstationary behavior, called Spartan Bayesian 
Optimization. Although it reaches its best performance in problems that are intrinsically nonsta¬ 
tionary, our evaluation shows that it can improve the results of Bayesian optimization in many se¬ 
tups. Additionally, we also present a method to actively learn the Gaussian process hyperparameters 
while conducting Bayesian optimization with two alternatives: using explicit exploration driven by 
information gain and using simultaneous perturbations to numerically estimate the variability of the 
model. Finally, we add some comments about using hierarchical Bayesian optimization to deal with 
complex input spaces. 

2 Nonstationarity in Gaussian processes 

Many applications of Gaussian process regression, including Bayesian optimization, are based on 
the assumption that the process is stationary and, in many times, isotropic. For example, the use of 
the isotropic squared exponential kernel in GPs is quite frequent: /cs_e(x, x') = exp((x x' ) 7 A(x- 
x')), where A = 6 l 1 1 and (1; represents the length-scale hyperparameter that captures the smooth¬ 
ness or variability of the function. That is, small values of 9i will be more suitable to capture signals 
with high frequency components; while large values of 9i result in model for low frequency signals 
or flat functions. 

This property results in an interesting behavior in Bayesian optimization. For the same distance 
between points, a kernel with smaller length-scale will result in higher variance, therefore the ex¬ 
ploration will be more aggressive. This idea has been explored previously in 1(221 by forcing smaller 
scale parameters to improve the exploration. More formally, in order to achieve no-regret conver¬ 
gence to the minimum, the target function must be an element of the reproducing kernel Hilbert 
space (RKHS) characterized by the kernel k (•, ■ ) EJED- For a set of kernels like the SE or Matem, 
it can be shown that, given two kernels ki and k s with large and small length scale hyperparameters 
respectively, any function f in the RKHS characterized by a kernel ki is also an element of the RKHS 
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Figure 1: Weighting function of the local and global kernels for nonstationary Bayesian optimiza¬ 
tion. The global kernel (in red) guarantees full support over the whole input space. The weight 
of local kernel (in blue) is based on a narrow Gaussian which can be moved through the space, 
therefore focusing on the region near the minimum value. 


characterized by k s E). Thus, using k s instead of /.:/ is safer in terms of guaranteeing convergence. 
However, if the small kernel is used everywhere, it might result in unnecessary sampling of smooth 
areas. 

Nevertheless, most of the applications where Bayesian optimization is used are heterotropic and 
nonstationary. For example, we may have an heteroscedastic function with different variability, 
frequency or smoothness in different regions. Thus, these functions require kernels with different 
length scales for those regions. Take for example a reinforcement learning problem where many 
policies might result in a failure condition, returning a similar null reward. Therefore, the reward 
function is almost flat except for a set of parameters where it actually varies. 

There has been several attempts to model nonstationary functions with Gaussian processes. The 
most popular is the use of specific nonstationary kernels m. Bayesian treed GP models 0 or 
projecting the input space to a stationary latent space na. Recently, a version of the later idea has 
been applied to Bayesian optimization 

Our approach to nonstationarity, the Spartan Bayesian Optimization algorithm, is based on the 
model presented in ifTOl where the input space is partitioned in different regions such as the result¬ 
ing GP is the linear combination of local GPs: £(x) = JN Aj(x)£j(x). Each local GP has its own 
specific hyperparameters, making the final GP nonstationary even when the local GPs are stationary. 
In order to achieve smooth interpolation between regions, Krause and Guestrin Col suggest the use 
of a weighting function z'i(x) for each region, having the maximum in region i and decreasing its 

value with distance to region i. Then, we can set A,;(x) = y 

For Bayesian optimization, we suggest the combination of a local and global kernel with multivariate 
normal distributions as weighting functions as presented in Figure [T] Assuming that the input space 
is bounded to the [0, \} d hypercube, which can be achieved by rescaling the original problem, we 
consider that for each dimension: 

^global = -A/"(0-5) 10); ^=^2,0.05) V k = l...d (3) 

where {0p2}f is considered to part of the set of hyperparameters of the surrogate model that are 
learned accordingly when new data is available 0 = {9 pos , 6*; oca j, 0 g ; o f, a /}. In that way, the position 
of the local kernel is adapting towards to the area near the minimum or other important area. 

The intuition behind this setup is the same of many adquisition functions in Bayesian optimization: 
the aim of the surrogate model is not to approximate the target function precisely in every point, 
but to provide information about the location of the minimum. For example, the resulting model 
could “flat out” most of the search space, as soon as the region near the minimum have the correct 
variability. Many optimization problems are difficult due to the fact that the region near the minimum 
has higher variability that the rest of the space, like the function in Figure[2] However, it is important 
to note that the kernel hyperparameters are initialized with the same prior for the local and global 
kernel. Thus, there is guarantee that the local kernel became the kernel with smaller length-scale. 
Depending on the data captured, it could learn a model where the local kernel has larger length-scale 
(i.e.: smoother) than the global kernel. 
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Algorithm 2 Spartan Bayesian Optimization (SBO) with MCMC 
1: for n = 1... N do 

2: X •<— x 1;n _!; y -f- t/i :n _i; 

3: for i = 1... to do > We have m MCMC samples 

4: ft(x,x'|6»j) «- Az(x|6»f os )Ai(x'|6if os )fc;(x,x'|^) + A g (x)A g (x')fc g (x,x'|0f) 

5: /ij ■<—k(x, X|0j)K(X, X|0j)y > Predicted mean 

6: a* t— A:(x, x|0*) — k(x, X|0j)K(X, X|0j)k(X, x| Of) > Predicted variance 

7: end for 

8: 0 <- {^global, o[ l Q \ aV 0pos}™ 1 > Update the hyperparameters using slice sampling 

9: x„ = argmax Xn Y%Li l(ybest ~ Vi) ®{zi) + Vi<l>{zi)] > Expected improvement 

10: y„ <5- /(x„) 

11: end for 



Figure 2: Results for the exponential 2D function. Center: we can see the optimization gap. Nonsta¬ 
tionary methods, such as WARP and the proposed method SBO result in an improved convergence. 
The sugested method was able to find the minimum always in less than 20 iterations. Right: Total 
optimization time in seconds. 


3 Active hyperparameter learning during Bayesian optimization 

As commented previously, learning accurate kernel hyperparameters is of paramount importance in 
Bayesian optimization, specially in the case of nonstationary models, where the number of hyperpa¬ 
rameters increases considerably. On the other hand, Bayesian optimization algorithms are rooted on 
the idea of having few samples. Therefore, learning those hyperparameters becomes a challenging 
problem. 

In this section, we present a method to actively learn the kernel hyperparameters while doing opti¬ 
mization, by explicitly exploring areas with high information gain over the hyperparameters. 

3.1 Active hyperparameter learning through Information Gain 

In order to improve the quality of the surrogate model, we will try to reduce the entropy of the hyper¬ 
parameters H(Q). Implicitly, this is done already in Bayesian optimization. Many popular criteria 
such as the expected improvement, the upper confidence bound, etc., include an exploration term 
which tries to minimize the predicting variance of the surrogate model. The information never hurts 
principle shows that any exploration strategy will, in expectation, decrease H(Q). See Proposition 
8 of iflOl for details. 

Krause and Guestrin fna also suggest a criterion for explicit exploration based on the information 
gain (IG) of the GP hyperparameters. Following that criterion, the decision for the next point to 
evaluate will be: 


C/G = tf(0|yi:„) - H(e |yi:„,y„+i) = argmax H{y n+l \y 1 , n ) - ff(y„ + i|yi :n , 0) 

X rr + 1 


~ - l0 S S ~ ^ + a i 

U=i 


m 

-b^iogof 
i=1 


(4) 
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where fi = JA 11 ,/m. Any other sampling distribution could be used for 0 adding the corresponding 
weight to each sample. For example, in the original work Col, a discrete distribution is used. 

The problem with a jq is that it is purely exploratory criterion, which is not intended for opti¬ 
mization. Thus, it needs to be combined with other optimization criterion. We suggest a linear 
combination with an annealing coefficient to gather information about the hyperparameters 0 at 
the beginning, and focusing on finding the minimum x* after several iterations. For example, the 
expected improvement with explicit information gain criterion is defined as: 


m 

CEIIG = V [(y best - Mi) $0i) + (Ti<j>{Zi)\ + 

Z ' n z 

i—1 



m 

[(^ “ V? + a i 


+E lo § 

i= 1 


(5) 


where Zi = {ybest — Mi)/ 17 *- The functions <&(•) and </>(■) are the normal cdf and pdf respectively. 
The coefficient a/n 2 represents the importance information gain component with respect to the 
expected improvement. Also, the coefficient is annealed to increase the effect of the IG early on to 
help learning the kernel hyperparameters. Eventually a/n 2 —» 0, resulting in the classical expected 
improvement once the hyperparameters are good enough. 


3.2 Simultaneous perturbation Bayesian optimization 

Analyzing the shape of the information gain criterion we found that, in many cases, the highest value 
was near a previous sample. We concluded that the leght-scale hyperparameter is related to the vari¬ 
ability of the function, which, at the same time, it is related to the local gradient. Therefore, in a 
certain way, the information gain criterion was estimating the gradient numerically to obtain infor¬ 
mation of the variability of the function. It is known that adding gradient information can improve 
considerably the accuracy of Gaussian process regression EE However, in many applications, the 
gradient is not available. Furthermore, using finite difference methods require many samples, which 
is against the philosophy of Bayesian optimization. 

In the field of stochastic optimization, the simultaneous perturbation stochastic approximation al¬ 
gorithm (SPSA) «20j is a variation of the classic finite difference algorithm for high dimensional 
spaces. In this method, the gradient is approximated by finite differences of a small subset of ran¬ 
domly sampled perturbations. The beauty of the SPSA algorithm is that, although the gradient is not 
perfectly estimated, the algorithm is guaranteed to converge to the local optimum. 


Algorithm 3 Simultaneous Perturbation Bayesian Optimization (SPBO) 

1 : for i = 1. .. N/2 do > Generates two samples per iteration 

2: Xj <r- Computed with Algorithm[I] 

3: if i < T then > We compute perturbations only at the beginning, during exploration. 

4: A x perturbation vector. Each component sampled from ail Bernoulli distribution 

5: Xj <- Xj + //A x 

6: end if 

7: end for 


We present an algorithm called Simultaneous Perturbation Bayesian Optimization, which, for every 
sample using Bayesian optimization, also computes a perturbed sample. Theoretically, this algo¬ 
rithm reduce the computational burden of Bayesian optimization and information gain inasmuch as 
it is applied half of the iterations. Compared to Bayesian optimization, the cost of sampling a ran¬ 
dom perturbation is negligible with respect to the cost of updating the Gaussian process model and 
computing the maximum information gain. Also, note how the perturbation can be computed only 
in the first part of the optimization process, similar to the annealing process in the previous section. 

4 Hierarchical Bayesian optimization 

In many Bayesian optimization applications, it is becoming a common trend to simultaneously opti¬ 
mize different kinds of variables, for example: continuous, discrete, categorical, etc. While Gaussian 
processes are suitable for modeling those spaces, Bayesian optimization can become quite involved 
in line [8] of Algorithm [T] as the acquisition function must be optimized in the same input space. 
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Figure 3: Hartmann 6D function. Left: Minimum value obtained. Center: zoom of the minimum 
value in the last iterations. Right: time in seconds. 


Available implementations of Bayesian optimization like Spearmint lfl8l use grid sampling and 
rounding tricks to combine different input spaces. However, this might reduce the quality of the 
final result lITTl compared to proper optimization methods. 

In this section, we suggest to use a hierarchical Bayesian optimization model, where the input space 
is partitioned between homogeneous variables, for example: continuous variables X'U and discrete 
variables x w L Therefore, the evaluation of an element higher in the hierarchy implies the full 
optimization of the elements lower in the hierarchy. In principle, that would require much more 
function evaluations but, as the input space has been partitioned, the dimensionality of each separate 
problem is much lower. In practice, for the same number of function evaluations, the computational 
cost is considerably reduced. 


Algorithm 4 Hierarchical Bayesian Optimization (HBO) — Total budget N = N\ ■ N 2 

1 

2 

3 

4 

5 

6 

7 

8 
9 

10 
11 
12 


x = [x( c ), x ur> \ > We separate the discrete and continuous components 

for n = 1... N c do > Outer loop - Continuous optimization 

Update model and kernels hyperparameters for x^ 

(c) 

Find continuous component of next point x„ 

for k = 1... N,i do > Inner loop - Discrete optimization 

Update model and kernels hyperparameters for x irr> 


Find discrete component of next point x 


(d) 


r / r (c 

yfc /([Xn 

end for 

y„ «- f([x { n\r* (d) 

end for 

X* = [x*W,x*W] 


f]) 

]) 


> Update x 


■(d) 


> Update x*( c ) and save corresponding x*^ 


An advantage of this approach is that we can combine different algorithms for different levels of the 
hierarchy. For example, using Random Forests j8j could be more suitable as a surrogate model for 
certain discrete/categorical variables than Gaussian processes. In contrast, we lose the correlation 
among variables in the inner loop, which might be counterproductive in certain situations. 

5 Evaluation and results 

For the evaluation of the suggested algorithms we have implemented them using the popular 
BayesOpj^J library fill . This allowed us to compare our proposal with a large variety of surro¬ 
gate models, kernels, etc. For example, the results presented in this sections are based on the stan¬ 
dard convention in Bayesian optimization literature, that is, a simple zero-mean Gaussian process, 
a Matern kernel 5/2 with automatic relevant determination for continuous variables, a Hamming 
kernel as presented in ll22l for categorical variables and slice sampling for learning the model hyper¬ 
parameters (kernel, warping, etc.). However, the suggested method has also been tested with other 
models such as Student-t processes, other kernels, etc. Due to the computational burden of MCMC 

1 The code will be available as GPL once the paper gets published. 
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Figure 4: Michalewicz 10D function with m=10. On the left, there is a visualization of the first 
and second dimensions to appreciate the difficulty of this function. The number of local minima 
increases exponentially with the number of dimensions. Note that the time is in hours. 


for the hyperparameters, we have used a small number of samples (10), while trying to decorrelate 
every resample by large burn-in periods (100 samples) following the convention in liT8l . 

We compare standard Bayesian Optimization (BO), Bayesian Optimization with Active hyperpa¬ 
rameter learning (ABO), Spartan Bayesian Optimization (SBO), Spartan Bayesian Optimization 
with Active hyperparameter learning (ASBO), Simultaneous Perturbation Bayesian Optimization 
(SPBO). We also implemented the input warping (WARP) of Snoek et al. 113 which, to the authors 
knowledge, is currently the only Bayesian optimization algorithm specific for nonstationary func¬ 
tions. All experiments were repeated between 10 and 25 times, depending on the problem, using 
common random number to reduce the sampling error between algorithms. All the experiments 
include 5-10 initial samples generated through latin hypercube sampling, not included in the plot. 

5.1 Optimization benchmarks 

We have evaluated the algorithms on a set of standard functions for global optimization. Although 
popular in the literature, we have avoided the use of the Branin-Hoo and Camelback functions as 
there is barely room from improvement using “vanilla” BayesOpt urn 

Figure [ 2 ] shows the results of optimizing the exponential 2D function /(x) = x\ exp(— xf — * 2 ) 
for Xi,X 2 £ [—2,18] 2 from |5]- The use of classical stationary models (BO) results in a poor 
convergence because of the high nonstationarity of the function. Using an active strategy speed up 
the results, requiring slightly fewer iterations. Clearly, nonstationary methods, such as Snoek et 
al [19| (WARP) and the proposed method (SBO) result in an improved convergence. In fact, the 
convergence of the proposed method requires a number of iterations so small that actively learning 
the kernel hyperparameters results in a waste of samples. 

The computation times of the different algorithms are clearly driven by the dimensionality and shape 
of the distribution of the MCMC. The cost of the active component is negligible when compared to 
their passive counterpart. We found that learning the Beta parameters for the warping function was 
extremely slow as many samples were rejected. This could be alleviated with a different MCMC 
algorithm such as hybrid Monte Carlo or sequential Monte Carlo samplers. However, note that we 
use slice sampling as recommended by the authors fI3 . 

For the Hardmann 6D function (shown in Figure |3j, the differences are not statistically significant, 
which shows that when the function is stationary, nonstationary methods are no worse than standard 
Bayesian optimization. Furthermore, we can see that at the end they are more robusts. 

The Michalewicz function is known to be one of the hardest benchmarks in surrogate based global 
optimization. Figure [4] shows the results. The active algorithms as well as the SPBO have been 
removed as they were statistically indifferent from their passive counterpart. 

5.2 Surrogate benchmarks for machine learning problems 

Our next set of experiments is based on well known benchmarks for automatic tunning of machine 
learning problems. However, in order to simplify the evaluation, we have used the surrogate bench¬ 
marks presented in |4). Among all the available benchmarks we have selected the Gradient Boosting 
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Figure 5: Surrogate benchmarks g] based on real hyperparameter optimization of machine learning 
algorithms. From left to right: logistic regression (4D continuous), online LDA (3D continuous) and 
deep neural network (HP-NNET with the mrbi dataset, 7D continuous, 7D categorical). Top row: 
loss functions. Bottom row: computational time, in seconds. 


as it provides the lowest RMSE with respect to the actual problem. We explicitly avoid the Gaussian 
process surrogate to avoid the advantage of perfectly modeling. The results are shown in Figure [5] 
For clarity, we show only the statistically significant results apart from BO, local and warp. 

First, the logistic regression is an easy problem for Bayesian optimization. Even the simple Bayesian 
optimization can reach the minimum in less than 40 iterations. In this case, the warped method is 
the fastest one, with less than 10 iterations. However, the proposed method is comparable specially 
when using active learning, by a fraction of the total cost. It is important to note that, although 
Bayesian optimization is intended for expensive functions and the cost per iteration is negligible, 
we are talking of minutes to hours for single iteration of the warped method in a standard laptop. 
For the onlineFDA problem, the warped method gets stuck like the standard Bayesian optimization, 
while our method is able to escape from the local minima. Surprisingly, the SPBO method is the 
best in this example, even though the computational cost is the lowest by a large margin. This shows 
that for certain applications and structured problems, gradient information is fundamental. 

Finally, we evaluate the HP-NNET problem. In this case, due to the high dimensionality and hetero¬ 
geneity of the input space (7 continuous + 7 categorical parameters) we have applied the hierarchical 
model presented in Section [4] Also, in order to reduce the computational cost, the nonstationary 
(warping or local kernel) is applied only on the continuous variables (outer loop). Note that, in this 
case, the plots are with respect to target function evaluations. In this case, due to the complexity of 
the problem, the local kernel fails to converge at an early stage. However, with more data available, 
the local kernel jumps to a good spot and the convergence is faster. 


6 Conclusions 


We have presented a new algorithm called Spartan Bayesian Optimization which combines a local 
and a global kernel to deal with nonstationarity in Bayesian optimization. Besides, we have shown 
that the model can improve convergence speed even in stationary problems. The suggested algorithm 
achieves similar or better results than the state of the art by a small fraction of the computational cost. 
In addition to that, we have presented some ideas to improve the results of any Bayesian optimization 
algorithm: by actively learning the surrogate hyperparameters, by efficiently estimating the gradient 
of the target function and by building a hierarchical model to reduce the heterogeneity in the input 
space. 
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